Since our dataset from the previous task contains only 4 usable columns, it is not suitable for a factor analysis task. Therefore we will use the data set geopol for this task instead. It contains some economical, demographical and health information about 41 countries across the world.
Let us perform a factor analysis of the dataset now. After brief exploration, we will use 4 factors for our analysis.
print(fa4, digits=2, cutoff=.5, sort=TRUE)
##
## Call:
## factanal(x = geopol[, c(1:length(geopol))], factors = 4, rotation = "varimax")
##
## Uniquenesses:
## popu giph ripo rupo rlpo rspo eltp rnnr nunh nuth
## 0.81 0.20 0.02 0.44 0.19 0.32 0.10 0.43 0.02 0.00
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## ripo -0.90
## rlpo -0.72
## eltp 0.78
## rnnr 0.65
## giph 0.64
## nunh 0.91
## rupo 0.55
## rspo 0.55
## nuth 0.71
## popu
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.97 1.99 1.31 1.20
## Proportion Var 0.30 0.20 0.13 0.12
## Cumulative Var 0.30 0.50 0.63 0.75
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 13.62 on 11 degrees of freedom.
## The p-value is 0.255
We might see that the 4 factors explains about 75% of the data variability, which is quite satisfying. Also the p-value of a formal statistical test whether 4 factors are sufficient is around 0.25, however the data are far from normal, therefore the result must be taken with high uncertainty.
Let’s continue with a comment about the loadings.
First factor correlates mainly with the variables ripo, rlpo, eltp, rnnr, which represent the rate of population increase, rate of illiteracy, expected lifetime and rate of nutritional needs. Therefore it somehow reflects the Human Development Index, which rates developing countries low.
Second factor correlates mainly with giph, which represents the gross internal product and moderately with nunh, which represents the number of magazines, therefore it measures the overall economic performance of the country.
Third factor correlates moderately with rupo and rspo, which represent the rate of urban population and rate of students. It seems that this factor does not have a clear label to it.
Forth factor is mostly correlated with nuth, which represents the number of televisions. Thus it might somehow measure the rate of comfort of the citizens.
Loadings to the particular factors might be seen on the figure below. Red color denotes negative correlation, blue color denotes positive correlation.
Finally, we will compare the results with the PCA. Let’s look at the loadings for the first three components, which together explain about 80% of the variability.
We might see that the PCA also groups ripo, rlpo and eltp together. It does not group rnnr with them that well. Pairs Giph, nunh and rupo, rspo are also grouped very well. Nuth does seem to form it’s own group even in the PCA.